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Abstract. Low reliability and availability of public SPARQL endpoints prevent 
real-world applications from exploiting all the potential of these querying infras¬ 
tructures. Fragmenting data on servers can improve data availability but degrades 
performance. Replicating fragments can offer new tradeoff between performance 
and availability. We propose Fedra, a framework for querying Linked Data that 
takes advantage of client-side data replication, and performs a source selection 
algorithm that aims to reduce the number of selected public SPARQL endpoints, 
execution time, and intermediate results. Fedra has been implemented on the 
state-of-the-art query engines ANAPSID and FedX, and empirically evaluated 
on a variety of real-world datasets. 


Keywords: SPARQL Federation, Replicated Fragments, Source Selection 


1 Introduction 

Linked Data J4j provides millions of triples for data consumers, however, recent stud¬ 
ies suggest that data availability is currently the main bottleneck to the success of the 
Semantic Web as a viable technology 121191 . Particularly, it has been reported by Buil- 
Aranda et el. 0 that only one third of the 427 public SPARQL endpoints have an 
availability rate equal or greater than 99%; representing this limitation the major obsta¬ 
cle to developing Web real-world applications that access Linked Data by using these 
infrastructures. Recently, the Linked Data Fragments (LDF) approach l20ll9l has ad¬ 
dressed availability issues by delegating query processing to clients, and by transform¬ 
ing public endpoints into simple HTTP-based triple-pattern fragments providers that 
can be easily cached by clients. This tradeoff effectively improves data availability, but 
it can significantly degrade performance E0l . However, we speculate that fragments 
caching approaches like LDF can have two important consequences for consuming 
Linked Data: (i) Each client is able to process SPARQL queries on replicated frag¬ 
ments cached from servers. Consequently, if clients are ready to cooperate, the cost 
of executing SPARQL queries can be significantly reduced, and a new compromise 
between availability and performance can be achieved, (ii) Potentially, clients could 
cache triple-pattern fragments from different data providers creating new localities for 
federated queries. Therefore, some joins can be now executed on one machine with¬ 
out contacting the public endpoints. This vision is also clearly proposed by Ibanez |9), 
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Table 1: Execution time and results for the same query over one and two replicas of 
DBpedia for FedX and ANAPSID 

where triple-pattern fragments can be replicated from SPARQL public endpoints, modi¬ 
fied locally, and made available through consumer data endpoints. Approaches in 191201 
demonstrate how SPARQL processing resources and simple triple-based fragments can 
be obtained from data consumers. We believe this represents a new opportunity for fed¬ 
erated query processing engines to improve SPARQL query processing performance by 
taking advantage of opportunistic replication and SPARQL processing offered by data 
consumers. 

However, current SPARQL federated query engines 1113.17114117] may exhibit poor 
performance in presence of replication. As presented in Figure [T| we duplicated DBpe¬ 
dia and executed a three triple pattern query against one instance and next two instances 
of DBpedia. We can observe that the performance in terms of execution time and num¬ 
ber of results is seriously degraded. This problem has been partially addressed by recent 
duplicate-aware source selection strategies (8lT5l . The proposed solutions rely on sum¬ 
mary of datasets to detect overlapping and do not consider fragments H9I19I201 . With 
fragments, replication is defined declaratively and does not need to be detected. 

In this paper, we propose Fedra, a source selection strategy that exploits fragment 
definition to select non-redundant data sources. In contrast to sm Fedra does not 
require information about the content of the data sources to detect overlapping. Fe¬ 
dra just relies on knowledge about the endpoint replicated fragments to reduce the 
number of endpoints to be contacted, and delegate join execution to endpoints. Fedra 
implements a set covering heuristic 0 to minimize the number of sources to be con¬ 
tacted. The implemented source selection approach ensures that triple patterns in the 
same basic graph pattern are assigned to the same endpoints, and consequently, it re¬ 
duces the size of intermediate results. We extend the state-of-the art federated query 
engines FedX fTTll and ANAPSID (T) with Fedra, and compare these extensions with 
the original engines. We empirically study these engines and the results suggest that 
Fedra efficiently reduces the number of public, replicated endpoints, and intermediate 
results. The paper is organized as follows: Section [2] presents related works. Section[3] 
describes Fedra and the source selection algorithm. Section[4]reports our experimental 
results. Finally, conclusions and future works are outlined in Section[5] 

2 Related Work 

In distributed databases, data fragmentation and replication improve data availability 
and query performance D3. Linked Data (4) is intrinsically a federation of autonomous 
participants where federated queries are unknown from a single participant, and a tight 
coordination of data providers is difficult to achieve. This makes data fragmentation 021 
and distributed query processing iflOl of distributed databases not a viable solution ITSl 
for Linked Data. 




















Recently, the Linked Data fragments approach (LDF) I120II19I proposes to improve 
Linked Data availability by moving query execution load from servers to clients. A 
client is able to execute locally a restricted SPARQL query by downloading fragments 
required to execute the query from an LDF server through a simple HTTP request. This 
strategy allows clients to cache fragments locally and decreases the load on the LDF 
server. A Linked Data Fragment of a Linked Data dataset is a resource consisting of the 
dataset triples that match a specific selector. A triple pattern fragment is a special kind 
of fragments where the selector is a triple pattern; a triple pattern fragment minimizes 
the effort of the server to produce the fragments. LDF chose a clear tradeoff by shifting 
query processing to clients, at the cost of slower query execution ED. On the other 
hand, LDF could create many data consumers resources that hold replicated fragments 
in their cache and is able to process SPARQL queries. This opens the opportunity to 
use these new resources to process SPARQL federated queries. Fedra aims to improve 
source selection algorithm of federated query engine to consider these new endpoints, 
and decreases the load on public endpoints. 

Col-graph 0 enables data consumers to materialize triple pattern fragments and to 
expose them through SPARQL endpoints to improve data quality. A data consumer can 
update her local fragments and share updates with data providers and consumers. Col- 
graph proposes a coordination free protocol to maintain the consistency of replicated 
fragments. Compared to LDF, Col-graph clearly creates SPARQL endpoints available 
for other data consumers, and allows federated query engines to use local fragments. 
As for LDF, Fedra can take advantage of these data consumer resources. 

Recently, HiBISCuS ll6l a source selection approach has been proposed to reduce 
the number of selected sources. The reduction is achieved by annotating sources with 
the URIs authorities they contain, and pruning sources that cannot have triples that 
match any of the query triple patterns. HiBISCuS differs from our aim of both selecting 
sources that are required to the answer, and avoiding the selection of sources that only 
provide redundant replicated fragments. While not directly related to replication, Hi¬ 
BISCuS index could be used in conjunction with Fedra to perform join-aware source 
selection in presence of replicated fragment. 

Existing federated query engines Hj3|7tl4|l7| are not able to take advantage of 
replicated fragments, and data overlapping can seriously degrade their performance as 
reported in Figure |T] and shown in cum We integrated Fedra within FedX and 
ANAPSID to make existing engines aware of replicated fragments. With Fedra, repli¬ 
cations as in Figure[l]will be detected, and performance will remain stable. 

Recently, QBBfJO and DAW H31 propose duplicate-aware strategies for selecting 
sources for federated query engines. Both approaches use sketches to estimate the over¬ 
lapping among sources. DAW uses a combination of Min-Wise Independent Permuta¬ 
tions (MIPs) 0, and triple selectivity information to estimate the overlap between the 
results of different sources. Based on how many new query results are expected to be 
found, sources below predefined benefits are discarded and not queried. 

Compared to DAW, Fedra does not require to compute data summaries because 
Fedra relies on fragment definitions to deduce containments. Computing contain¬ 
ments based on fragment descriptions is less expensive than computing data summaries, 
moreover data updates are more frequent than fragment description updates. Fedra 








PI Endpoint 


P2 Endpoint 


r ' 

tl pi cl . cl p2 11 . 
tl pi c2 . c2 p2 06 . 
cl p4 dl . c2 p4 d2 

^f9 

f3 

dl p3 gl . dl p5 c2 . 
d2 p3 g2 . d2 p5 c3 . 
dl p6 nl . dl p7 m . 
d2 p6 n2 . d2 p7 e 

fl 

\ 


o 



r tl pi cl. 
tl pi c2. 
cl p2 11. 
c2 p2 06. 
dl p3 gl. 
d2 p3 g2 


cl p4 dl . 
c2 p4 d2 . 
dl p5 c2 . 
d2 p5 c3 . 
dl p6 nl . 
d2 p6 n2 


tl pi cl. 
tl pi c2. 
cl p4 dl . 
c2 p4 d2 . 
dl p7 m 


cl p2 11. 
c2 p2 06. 
dl p5 c2 . 
d2 p5 c3 . 
d2 p7 e 


dl p3 gl. 
d2 p3 g2. 
dl p6 nl . 
d2 p6 n2 . 
tl pi cl 


Cl C2 C3 C4 C5 

Endpoint Endpoint Endpoint Endpoint Endpoint 


(a) Public and Consumer Endpoints Federation 
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(b) Fragments Definitions 


Fig. 1: figure [(a)1 describes how fragments of public endpoints are replicated at consumer 
endpoints. Table |(b)| describes the selector of each fragment, ’’authoritative” source (AS) 
and Endpoint where the fragment is available 


also try to remove public endpoints and minimize the number of endpoints to execute 
a query. Consequently, if DAW and Fedra could find the same number of sources to 
execute a query, Fedra minimizes the number of public endpoints to be contacted. Ad¬ 
ditionally, Fedra source selection considers the query basic graph patterns to delegate 
join execution to the endpoints and reduce intermediate results size. This key feature 
cannot be achieved by DAW as it performs source selection only at the triple pattern 
level. 


3 Fedra Approach 

Fragments and Endpoints Descriptions To define a fragment, we will use the Linked 
Data Fragment definition given by Verborgh et al. ED- Let U. C, and V denote the 
set of all URIS, literals and variables, respectively. T* = U x U x (U u C) is a 
finite set of blank-node-free RDF triples. Every dataset G published via some kind of 
fragments on the Web is a finite set of blank-node free triples; i.e ., G <= T*. Any tuple 
tp e (U u V) x (U u V) x {U u V u C) is a triple pattern. 

Definition 1 (Fragment lfl9l ). A Linked Data Fragment (LDF) of G is a tuple f = 
(u, s, l , M, C) with the following five elements: i ) u is a URI (which is the “authorita¬ 
tive” source from which f can be retrieved); ii) s is a selector; Hi) r is a set consisting 
of all subsets of G that match selector s, that is, for every G' c G it holds that G'ef 
if and only if G' e dom(s) and s(G') = true; iv) M is a finite set of (additional) RDF 
triples, including triples that represent metadata forf; and v) C is a finite set of controls. 

We restrict fragments to triple pattern fragments as in |9|20| . Hereafter, we consider 
that fragments are read-only and data cannot be updated; the fragment synchronization 






























Listing 1.1: Cl Endpoint Description 


@prefix sd:<\protect\vrule widthOpt\protect\href{http ://vwwv.w3. org/ns/ 

sparql—service—description#}{/?ffp ://www. w3. org/ns/sparql—service— 
description#}>. 

@prefix dc: <\protect\vrule widthOpt\protect\href{http ://purl. org/dc/ 

elements/1.1/} { http ://purl. org/dc/elements/1.1/} > . _ • * r\ r\ i j 

@prefix determs : <\protect\vrule widthOpt\protect\href{http :// purl.org/dc/ .Listing \.Z‘. v^UCIlCS C^l cUlCl 
terms/}{http ://purl. org/dc/terms/}> . __ 

[] <\protect\vrule widthOpt\protect\href{http ://www.w3. org/1999/02/22— rdf— K^Z 
syntax— r\s#type}{http ://www. w3. org/1999/02/22— rdf—syntax—ns#type}> sd 
: Service ; Q1 ; 

sd:endpoint <\protect\vrule widthOpt\protect\href{http ://consumer1/sparql 
}{http ://consumer1 /sparql}>; 
determs: hasPart [ 

dc: description ’’Construct where{ ?x pi ?y }’’; Q2 ; 

determs : source <\protect\vrule widthOpt\protect\href{http :// 

publicEndpointl /sparql}{http :// publicEndpointl /sparql}>; 
determs: hasPart [ 

dc: description ’’Construct where{ ?x p2 ?y }’’; 
determs : source <\protect\vrule widthOpt\protect\href{http :// 

publicEndpointl /sparql}{http :// publicEndpointl /sparql}>; 
determs: hasPart [ 

dc: description "Construct where{ ?x p3 ?y 
determs : source <\protect\vrule widthOpt\protect\href{http :// 

publicEndpoint2/sparql}{http :// publicEndpoint2/sparql}>; ] 


] 


] 


CONSTRUCT 
where { ?x1 

CONSTRUCT 
where { ?x1 

?x1 


pi ?x2 } 


p4 ?x2 . 
p7 ?x3 } 


problem is studied in (9j. 

Consider the federation in Figure [Ta] where five data consumer endpoints ( C1-C5 ) in¬ 
clude fragments (fl-f9 ) from public endpoints PI and P2. Table lb shows the SPARQL 
CONSTRUCT query used as a selector for a fragment. Fragments fl,f2,f4, and /9 have 
as “authoritative” source PI, and fragments f3 and/5/8 have as “authoritative” source 
P2. The last column presents the consumer endpoints where fragments are available. 
To participate in a Fedra federation, data consumers annotate each fragment exposed 
through their endpoints with the fragment selector s and the public endpoint that pro¬ 
vides the data u. The vocabulary term sd:endpoint refers to the SPARQL endpoint that 
publishes the fragment. The vocabulary term detenus:hasPart introduces a fragment 
description, the vocabulary term dc:description refers to the SPARQL CONSTRUCT 
query s, and the vocabulary term detenns:source specifies u, the fragment “authorita¬ 
tive” source URI. Listing [TT| shows the description of the endpoint Cl of Figure [Ta] 
For source selection with replicated fragments, we need to define when a fragment 
is relevant for answering a query. A fragment is relevant for answering a query, if it is 
relevant for at least one triple pattern of the query. 


Definition 2 (Fragment relevance). A fragment f = (u. s, I', M, C ) is relevant for a 
triple pattern tp, if the triple pattern evaluated over I , [tp]_r 1/il/ - is not empty. 

Consider queries Q1 and Q2 (cf. Listing [L2| ), and the federation of Figure [T] Frag¬ 
ments// and f9 are relevant for query Ql, while fragments f4, f7, and f8 are relevant 
for query Q2. We can define two types of containments: containment between SPARQL 
endpoints and containment between fragments. 


Definition 3 (Endpoint Containment). Let el and e2 be the URI of two SPARQL end¬ 
points that respectively expose fragments f 1 = (ul, s, Tl, Ml, Cf) and f 2 = (u2 , 
s , r 2, M2, C2) such that fl andf2 have the same selector s and the same “authorita¬ 
tive” source (u2 = u 1). Then all triples in f2 are contained in fl and vice versa, i.e., 
T2 c p 1 a n C r 2. And we use the notation el e2, and e2 c s el to represent the 
endpoint containment relationship. 



SELECT DISTINCT * 
WHERE { 
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(a) Query Q3 (b) RF for Q3 
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(c) Conditions 1-3 (d) Conditions 1-4 


Fig. 2: SSP solutions for query Q3: (a) Query Q3, (b) Relevant fragments for Q3, (c) 
Maps Di and D 2 that satisfy Source Selection Problem (SSP) conditions 1-3., (d) Map 
D 2 that satisfies Source Selection Problem (SSP) conditions 1-4 


fl triples in consumers Cl and C3 from Figure[I]are the same (from Definition [3]): 
Cl E s C3 and C3 E s Cl where s is fl selector. Endpoint containment can be used 
to reduce the number of endpoints to contact to answer a query, fl and f9 are relevant 
for query Q1 , from the endpoint containments// data is also available through PI, Cl, 
and C3 and/9 data is also available through PI and C5. Therefore, only one endpoint 
per fragment needs to be selected to answer the query. A good choice could be Cl, 
C5 or C3, C5; this will reduce the load of the public endpoint PI and will improve PI 
availability. By contacting only Cl, C5 or C3, C5, complete query answers are obtained 
because the triple pattern fragment defines a copy of the data source using the fragment 
selector. Another type of containment that allows to reduce the number of sources to be 
contacted is defined based on the fragment selector. 

Definition 4 (Fragment Containment). Let fl = (u,sl,n,Ml,Cl) and f 2 = 
(u,s2,r2,M2,C2) be two fragments that share the same “authoritative” source u, 
and a triple pattern tp. If for all possible values of T 1 and 12, always the triples in Cl 
that match tp are also in T2, i.e., [ip]n £ Mr2. Then, regarding tp, fl is contained 
in f 2. And we use the notation fl E f2. 

Triples of fragment f9 replicated at consumer C5 in Figure |T| are contained in fragment 
fl at Cl and C3 (f9 E/7) because// and f9 share the same “authoritative” source, and 
all the triples that match predicate pi and object cl always matches predicate pi in fl. 
Using fragment containment, contacting Cl or C3 is enough to answer Ql. 


Source Selection Problem (SSP) Given a set of SPARQL endpoints E, a set of public 
endpoints P, P E E, the set of fragments contained in each endpoint as a function 
frags : Endpoint —» set of Fragment, a containment relation among endpoints (given by 
Definition [3} for f e frags(et) a f e fragsfef), e, E/ e t , a containment relation among 
fragment selectors (given by Definition |4j f) E fk , and a SPARQL query Q. Find a 
map D, such that for each triple pattern tp in Q, D(tp) E E and: 1) For each endpoint 
e that may contribute with relevant data to answer query Q, e is included in D, or D 
includes another endpoint that contributes with at least the same relevant data as e. 
2) D(tp) contains as few public endpoints as possible. 3) size(D(tp)) is minimized for 
all triple pattern tp in Q. 4) The number of different endpoints used within each basic 
graph pattern is minimized. 




































Condition 1 states that the selected sources will produce an answer as complete 
as possible given the set of fragments accessible through the endpoints E, but answer 
may be incomplete if some fragments definitions are missing. Condition 2 ensures that 
public endpoints availability problem will be avoided whenever is possible. Condition 
3 establishes that the number of selected sources is reduced. Condition 4 aims to reduce 
the size of intermediate results, and to delegate the join execution to endpoints whenever 
is possible. Even if the public endpoint can provide all the fragments, the use of several 
consumer endpoints is preferable. To illustrate these four conditions, consider query 
Q3 in Figure [2a] and fragments and endpoints of Figure [I] Table [2b] shows the relevant 
fragments for Q3 triple patterns, and the endpoints that provide these fragments. For 
example, for the triple pattern ?xl pi 7x2, there are two relevant fragments /1 and /9. 
As previously discussed using endpoint containments, contacting Cl or C3 is enough to 
answer this triple pattern without contacting the public endpoint. The maps D and D 2 
in Figure [2c] satisfy the SSP conditions 1-3: all relevant fragments have been included 
directly or through containment relation, the number of selected endpoints per triple 
pattern has been minimized, and no public endpoints has been included in the map. 
However, only the map D 2 satisfies condition 4, as the number of different endpoints 
selected per basic graph pattern has been also minimized (see Figure [2d]). Then, joins 
are delegated to the selected endpoints, and the size of intermediate results is reduced. 
Current state-of-the-art 4751/ is triple pattern wise and does not guarantee condition 4. 

Source Selection Algorithm Algorithm |T| sketches the Fedra source selection algo¬ 
rithm. First, the algorithm pre-selects for each triple pattern in Q the sources that can 
be used to evaluate it (lines 2-29). All the endpoints e and their exposed fragments / 
are considered (lines 5-27). In line 6, the function canAnswerf) is used to determine 
if endpoint e can provide triples from fragment / that matches triple pattern tp. An 
initial check based on the selector of / and tp is done, and when it is satisfied, a dy¬ 
namic check using an ASK query is done to ensure that / is relevant for triple pattern 
tp (as in Definition [2]). An ASK query is used to avoid considering fragments that are 
not relevant for the triple pattern, in the case the triple pattern has constants where the 
fragment definition has variables. Query Q 3 relevant endpoints and fragments are given 
in Table [2b] 

The function subFrag determines if the data provided by one fragment is also pro¬ 
vided by another fragment. This function has as arguments the fragments and endpoints 
that provide them, and also the containment relationships. For each relevant fragment 
/, it determines if the considered fragment / in endpoint e provides the same data as 
already found fragments or if it provides at least the same data as already found frag¬ 
ments or if it provides at most the same data as already found fragments (lines 8-22). 
Function subFrag tests if there is a containment in both senses or only in one of them. 
Accordingly, the fragment is grouped with the fragments that provide the same data 
(lines 10-12), some of the already found fragments are not anymore of interest (line 
14-15), or it is chosen not to include the fragment (lines 17-18). Between the fragments 
/1 and /9 selected as relevant fragments for the first triple pattern of Q3, there is a 
containment /9 != /1, and in consequence, only the fragment /1 needs to be selected. 
Moreover, /1 is provided by endpoints Cl, C 3, and PI, and as they provide the same 
data, each of them can be selected alone to provide data for this triple pattern, i.e., 


Algorithm 1 Source Selection algorithm 


Require: Q: SPARQL Query; E: set of Endpoints; frags : 
Endpoint —> set of Fragment; P : set of Endpoint; ^ f : 
Endpoint x Endpoint; != : BGP x BGP 
Ensure: D: map from Triple Pattern to set of Endpoints. 

1: function SOURCeSelection(Q,E, frags,P,Q/,Q) 

21 triplePatterns *— get triple patterns in Q 
3 : for each tp e triplePatterns do 
4: fragments <— 0 
5: for each e e E a f e frags(e) do 
6: if canAnswer(e, f, tp) then 

71 include <— true 
8: for each fs e fragments do 

9: (f’,e’) <— take one element of fs 

10: if subFrag(f,e,f’,e’,c= /,!=) a subFrag(f’,e’,f,e,C/,! 

then 

11: fs.add((f,e)) 

12: include <—false 

13: else 

14: if subFrag(f’,e’,f,e,c= /,^) then 

15: fragments.remove(fs) 

16: else 

17: if subFrag(f,e,f’,e’,<= /,c) then 

18: include <— false 

19: end if 


20: end if 

21: end if 

22: end for 

23: if include then 

24: fragments. add({(f,e)}) 

25: end if 

26: end if 

27: end for 

28: G(tp) <— getEndpoints(fragments) 

29: end for each 

30: basicGP <— get basic graph patterns in Q 
3 1: for each bgp e basicGP do 

32: (S, C) <— minimal set covering instance using 

;) bgp<G 

33: C’ <— minimalSetCovering(S, C) 

34: for each tp e bgp do 

35: G(tp) <— filter G(tp) according to C’ 

36: end for each 

37: end for each 

38: for each tp e domain(G) do 

39: D(tp) <— for each set in G(tp) include one element 

40: end for each 

41: return D 

42: end function 


Table 2: fragments changes due to execution of lines 5-27 of Algorithm[I]for the first 
triple pattern of Q3 (left) and the second triple pattern of Q2 (right) 


line 

fragments 

4 

11 

24 

{ { (ft. Cl) } } 

11 

{ { (ft. Cl), (fl, C3) } } 

11 

{{(ft, Cl), (fl. C3). (fl, PI) } } 


line 

fragments 

4 

11 

24 

{ { (f7, C3) } } 

24 

{ { (f7, C3) }, { (f8. C4) } } 

11 

{ { (f7, C3), (f7, P2) }. { (fS, C4) } } 

11 

{ { (f7, C3), (f7, P2) }, { (f8, C4), (f8, P2) } } 


they offer alternative sources for the same fragment. Table [2] (left) shows the changes 
in fragments for the first triple pattern of Q 3 . Contrarily, between the fragments /7 
and /8 selected as relevant for the second triple pattern of Q 2, there is no containment 
relation, and in consequence, both fragments are selected. Table [2] (right) shows the 
changes in fragments for the second triple pattern of Q 2. 

When the fragment provides non-redundant data, it is included in the selected frag¬ 
ments (lines 23-24). In the previous examples, it happens once for the first triple pattern 
of Q3 and twice for the second triple pattern of Q2. Function getEndpoints in line 28 
takes the endpoints of fragments, and when for one subset fs there are endpoints in E - P, 
then all endpoints in P are removed. In the previous examples, the public endpoints are 
removed and the value of G(?xl pi ?x2) is {{C1,C3}}, and the value of G(?xl p7 ?x3) 
is {{ C3 }, { C4 }}. At the end of the first for loop (lines 3-29), a set whose elements are a 
set of endpoints and fragments that can be used to evaluate the triple pattern is produced 
for each triple pattern in the query. All the endpoints in the same set offer the same data 
for that fragment, and during execution only one of them needs to be contacted. And 
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(a) S instance (b) C instance 


Fig. 3: Set covering instances of S and C for the query Q 2 and federation given in 
Figure]!] For each element in G(tp), one element is included in set S. For each endpoint 
in G, one set is included in collection C and its elements are the elements of S related 
to the endpoint 


different elements of this resulting set correspond to different fragments that should be 
considered in order to obtain an answer as complete as possible, modulo the considered 
fragments. For queries Q2 and Q3, Cl and C3 provide the same data for the first triple 
pattern of Q3, and C3, C4 provide different data for the second triple pattern of Q2 . 

Next, for each basic graph pattern, a general selection takes place, considering the 
pre-selected sources for each triple pattern in the basic graph pattern. This last part can 
be reduced to the well-known set covering problem, and an existing heuristic like the 
one given in 0 may be used to perform the procedure indicated in line 33. To use the 
set covering heuristic, an instance of the set covering problem is produced in line 32 
using bgp <3 G^] Figure 3a shows G values obtained after lines 3-29 loop has ended 
for query Q2, and S instance for the set covering problem. For each set in G(tp), one 
element is included in S, e.g., for set {C2,C3}, the element ,s'i| is included in S. We 
have used subscripts i,j to denote that the element comes from the triple pattern /, and it 
is the /-th element coming from this triple pattern. The collection C is composed of one 
set for each endpoint that is present in G, and its elements are the elements of S related 
to each endpoint. Figure [3b] shows the instance of C for this example. The collection C” 
obtained in line 33 is { C3,C4 }, as the union of C3 and C4 is equals to S, and there is 
no smaller subset of C that can achieve this. The instruction in line 35 removes from 
each G(tp) the elements that do not belong to C’. In the example, C2 is removed from 
G(?xl p4 ?x2). A last step may be performed to choose among endpoints that provide 
the same fragment and ensure a better query processing by existing federated query 
engines (lines 38-40). Nevertheless, these alternative sources could be used to speed up 
query processing, e.g., by getting a part of the answer from each endpoint. 


4 Experiments 

We conducted many experimentations in different setups to demonstrate the impact of 
Fedra on existing approaches, complete results are available at the Fedra web sit^] 
The performance of the FedX and the ANAPSID query engines is our baseline. General 
results are the comparison of performance of FedX alone, FedX+DAW, FedX+FEDRA, 

3 bgp <1 G represents function domain restriction, i.e., it takes the elements of map G that relates 
elements in bgp with some set of set of endpoints 

4 https://sites.google.com/site/fedrasourceselection 








Dataset 

Version date 

# DT 

#P 

Dataset 

ST 

2P 

3P 

4P 

2S 

3S 

Diseasome 

19/10/2012 

72,445 

19 

Diseasome 

5 

4 

5 

2 

5 

5 

Semantic Web Dog Food 

08/11/2012 

198,797 

147 

Semantic Web Dog Food 

5 

7 

7 

4 

5 

5 

DBpedia Geo-coordinates 

06/2012 

1,900,004 

4 

DBpedia Geo-coordinates 

5 

0 

0 

0 

5 

5 

LinkedMDB 

18/05/2010 

3,579,610 

148 

LinkedMDB 

5 

0 

0 

0 

0 

0 


(a) Datasets 


(b ) Queries sizes and number 


Dataset 

J 

OP 

u 

F(R) 

L 

OB 

Dataset 

# FD 

# E1F 

# E2F 

Diseasome 

19 

2 

1 

11(7) 

4 

9 

Diseasome 

16 

16 

120 

Semantic Web Dog Food 

24 

5 

2 

8(7) 

4 

9 

Semantic Web Dog Food 

40 

40 

780 

DBpedia Geo-coordinates 

10 

0 

0 

6(1) 

6 

6 

DBpedia Geo-coordinates 

4 

4 

6 

LinkedMDB 

0 

0 

0 

KD 

1 

2 

LinkedMDB 

4 

4 

6 


(c) Queries with operators, modifiers (d) Federations 

Table 3: Datasets, queries and federations characteristics. For the datasets: the version, 
number of different triples (# DT), and predicates (# P). For the queries: the number of 
queries with 1 triple pattern (ST), 2, 3 or 4 triple patterns in star shape (S) or path shape 
(P). Also the number of queries with: joins (J), optionals (OP), unions (U), filter (F), 
regex expressions (R), limit (L), and order by (OB). For the federations: the number of 
fragments definitions (FD), endpoints exposing one and two fragments (E1F, E2F) 


same thing for ANAPSID. Compared to FedX and ANAPSID, Fedra should reduce 
selected sources significantly and speed up queries execution time. Compared to DAW, 
we expect Fedra to achieve same source reduction but without pre-computed MIPS in¬ 
dex, and generate less intermediate results thanks to endpoints reduction and to finding 
join opportunities. 

Datasets, Queries and Federations Benchmark: we used Diseasome, Semantic Web 
Dog Food, FinkedMDB, and DBpedia geo-coordinates datasets. Table 3a shows char¬ 
acteristics of the evaluated datasets. We studied the datasets and queries used in EB 
However, we modified the queries to include the DISTINCT modifier in all the queries. 
Additionally, the ORDER BY clause was included in the queries with the FIMIT clause, 
in order to make them susceptible to a reduction in the set of selected sources without 
changing the query answer, and to ensure a semantically unambiguous query answer. 
Tables 3b and [3c] present queries characteristics. As federations used in na do not 
take into account fragments, they were not reprised. A federation was set up for each 
dataset; each federation is composed of the triple pattern fragments that are relevant 
for the studied queries. The federation endpoints offer one or two fragments; endpoints 
with two fragments offer opportunities to execute joins for the engine. Table [3d] shows 
the federation characteristics. In average, DAW indexes were computed in 2,513 secs, 
and Fedra containments in 32 secs. 


Notice that as Fedra containments depends only on fragment descriptions, their 
updates are less frequent than DAW indexes. Virtuoso 6.1.^] endpoints were used. A 
Virtuoso server was set up to provide all the endpoints as virtual endpoints. It was 
configured with timeouts of 600 secs, and 100,000 tuples. In order to measure the size 
of intermediate results, proxies were used to access the endpoints. 


5 They are available at https : / /sites . google . com/site/dawf ederation 

6 http: / /virtuoso . openlinksw. com/ , November 2013. 
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Fig. 4: FedX and Fedra Number of Selected Sources (NSS) when the percentage of 
known containments is 0%, 25%, 50%, 75% and 100%, for Diseasome Federation 


Implementations: FedX S.cQand ANAPSIE^jhave been modified to call Fedra and 
DAW |fl5l source selection strategies during query processing. Thus, each engine can 
use the selected sources to perform its own optimization strategies. Because FedX is im¬ 
plemented in Java, while ANAPSID is implemented in Python, Fedra and DAW0were 
implemented in both Java 1.7 and Python 2.7.3.. Thus, Fedra and DAW were inte¬ 
grated in FedX and ANAPSID, reducing the performance impact of including these new 
source selection strategies. Proxies were implemented in Java 1.7. using the Apache 
HttpComponents Client library 4.3. 50 

Evaluation Metrics: i) Number of Selected Public Sources (NSPS): is the sum of the 
number of public sources that has been selected per triple pattern, ii ) Number of Selected 
Sources (NSS): is the sum of the number of sources that has been selected per triple 
pattern, in) Execution Time (ET): is the elapsed time since the query is posed by the 
user and the answers are completely produced. It is detailed in source selection time 
(SST), and total execution time (TET). Time is expressed in seconds (secs.). A timeout 
of 300 secs, has been enforced. Time was measured using the bash time command. 
iv) Intermediate Results (IR): is the number of tuples transferred from all the endpoints 
to the query engine during a query evaluation, v) Recall (R): is the proportion of results 
obtained by the underlying engine, that are also obtained including proposed strategy. 

Impact of the number of containments over FEDRA behavior We study the impact 
of the number of containments known during the source selection on the number of 
selected sources and intermediate results size. For each query, the set of known con¬ 
tainments have been set to contain 0%, 25%, 50%, 75%, and 100% of the containments 
concerning the relevant fragments, and queries have been executed in these five configu¬ 
rations. Results show that the number of selected public sources is equal to the number 
of triple patterns in the queries when no containment is known. However, as soon as 
some containments are known (25%-100%), this number is reduced to zero. Also, the 

7 http: / /www. f luidops . com/fedx/, September 2014. 

8 https : //github. com/anapsid/anapsid, September 2014. 

9 We had to implement DAW as its code is not available. 

10 https : / /he. apache . org/ , October 2014. 





























Query Type Query Type 

Fig. 5: Source selection time (SST) for Geo-coordinates federation. The FedX (F, left) 
and the ANAPSID (A, right) query engines are combined with Fedra (F+F and F+A), 
and DAW (D+F and D+A) 

number of selected sources is considerably reduced when Fedra source selection strat¬ 
egy is used instead of just using ANAPSID and FedX source selection strategies; see 
FedX results in Figure [4] ANAPSID results exhibit a similar behavior. 

Preservation of the Query Answer The goal of this experiment is to determine the im¬ 
pact of Fedra source selection strategy on query completeness. Queries were executed 
using both the ANAPSID and the FedX query engines, and then, we executed the same 
engines enhanced with the Fedra source selection strategy. Recall was computed and 
was 1.0 in the majority of the cases. In few cases the recall was considerably reduced, 
but these cases correspond to queries with OPTIONAL operator using the FedX query 
engine, and it was due to an implementation error for this operator. Fedra only dis¬ 
cards relevant sources when relevant fragments data are provided by another source that 
was already selected. Then, it does not reduce the recall of the answer. Finally, the query 
engine implementation limitations were also the causes of the reduction of recall. 

Source Selection Time To measure the Fedra and DAW source selection cost, the 
source selection time using each engine with and without Fedra or DAW was mea¬ 
sured. Results are diverse, for federations with a large number of endpoints like the 
SWDF federation, the cost of performing the Fedra source selection is considerably 
inferior to the cost of contacting all the endpoints using FedX, but similar to the cost 
of using the ANAPSID source selection strategy. On the other hand, the DAW cost is 
similar to the FedX cost, and it is considerably superior to the ANAPSID cost. For fed¬ 
erations with a small number of endpoints like Geo-coordinates (see Figures[5a|and|5b|), 
the cost of performing source selection with Fedra is less expensive than performing 
just FedX source selection, but more expensive than performing just ANAPSID source 
selection strategy. And the cost of performing source selection with DAW is more ex¬ 
pensive than performing just FedX or ANAPSID source selection strategy. ANAPSID 
source selection mostly relies on the endpoints descriptions, and avoids to contact end¬ 
points in most cases. On the other hand, FedX source selection strategy relies on end¬ 
point contacts to determine which endpoints can be used to obtain data for each triple 
pattern. Fedra source selection is somewhere in the middle, it does contact all the end¬ 
points that are considered relevant according to their descriptions to confirm that they 
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Fig. 6: Number of Selected Sources and Total Execution Time (TET) for Geo¬ 
coordinates federation and the FedX (F) query engine. FedX is also combined with 
Fedra and DAW source selection strategies (F+F and D+F). For F, 4 out of 5 queries 
timed out for each query type. For F+F, 4 out of 5 queries timed out for 3STAR queries. 
For D+F, 4 out of 5 queries timed out for 2STAR and 3STAR queries 

can provide relevant data, and has the added cost of using the containments to reduce 
the number of selected sources. DAW source selection does not contact sources, but 
relies on no negligible cost of operating Min-Wise Independent Permutations (MIPs) in 
order to determine the overlapping sources. 

Execution Time To measure the Fedra and DAW execution time, queries were exe¬ 
cuted using each engine with and without Fedra or DAW. For small federations or 
queries with only one triple pattern, Fedra and DAW achieve a similar reduction in 
execution time. However, for larger federations and queries with more triple patterns, 
Fedra reduction is larger than DAW. In all cases, the reduction is considerable when 
the combination of Fedra and the query engine is compared to using the engine alone, 
e.g.. Figure [6b] shows the results for the Geo-coordinates federation and FedX. For 
star-shape queries, the use of Fedra source selection strategy has made the difference 
between timing out or obtaining answers. For queries with two triple patterns, this dif¬ 
ference is important, as Fedra enhances FedX to obtain answers in just few seconds. 
The difference in execution time is a direct consequence of the selected sources reduc¬ 
tion. Further, executing the joins in the endpoints whenever it is possible, may reduce 
the size of intermediate results and produce answers sooner. 

Reduction of the Number of Selected Sources To measure the reduction of the number 
of selected sources, the source selection was performed using ANAPSID and FedX with 
and without Fedra or DAW. For each query, the sum of the number of selected sources 
per triple pattern was computed, for all the sources and just for the public sources. Fig¬ 
ure [6] shows the results for the Geo-coordinates federation and FedX, similar results 
are observed for the other federations and for ANAPSID. DAW source selection strat¬ 
egy exhibits the same reduction in the total number of selected sources. Consequently, 
some of the selected public sources are pruned, but as it does not aim to reduce the 
public sources, it does not achieve a consistent reduction of them. On the other hand, 
Fedra has as input the public condition of sources, and as one of its goals is to select as 
few public sources as possible, it is natural to observe such a reduction consistently for 
































Query Type Query Type 

(a) Federation with public endpoint (b) Federation without public endpoint 

Fig. 7: Intermediate results size for Diseasome federation and FedX (F). FedX is also 
combined with Fedra and DAW source selection strategies (F+F and D+F) 

all the query types. Fedra source selection strategy identifies the relevant fragments 
and endpoints that provide the same data. Only one of them is actually selected, and in 
consequence, a huge reduction on the number of selected sources is achieved. More¬ 
over, public endpoints are safely removed from the selected sources as their data can be 
retrieved from other sources. 

Reduction of the Intermediate Results Size To measure the intermediate results size 
reduction, queries were executed using proxies that measure the number of transmitted 
tuples from endpoints to the engines. Additionally, each query was executed against the 
federation with and without the public endpoint. Figure [7] shows the sizes of interme¬ 
diate results for the Diseasome federation and FedX combined with Fedra and DAW; 
similar results were obtained for the other federations and for ANAPSID. Figure [7a] 
shows that when the public endpoint is part of the federation, DAW source selection 
strategy leads to executions with considerably less intermediate results. 

Figure [7b] shows that when the public endpoint is not part of the federation, Fe¬ 
dra source selection strategy leads to executions with considerably less intermediate 
results. Since the Fedra source selection strategy finds opportunities to execute joins 
in the endpoints, and mostly, it leads to significant reduction in the intermediate results 
size. These results are consequence of SSP Condition 4, and cannot be systematically 
achieved by DAW as it is a triple wise based approach. Nevertheless, as DAW source 
selection does not avoid public endpoints, it may select to execute all triple patterns 
in the public endpoint, and this comes with a huge reduction of the size of intermedi¬ 
ate results. Figure [7b] shows that when this “public endpoint” opportunity to execute 
all triple patterns in one endpoint is removed, DAW source selection strategy does not 
consistently reduce the intermediate results size. 

5 Conclusions and Future Works 

Recent works on replicated fragments spread data and SPARQL processing capabili¬ 
ties over data consumers. This opens new opportunities for federated query processing 
















by offering new tradeoffs between availability and performance. We presented Fedra, 
a source selection approach that takes advantage of replicated fragment definitions to 
reduce the use of public endpoints as they are mostly overloaded. Fedra identifies 
and uses opportunities to perform join in the sources relieving the query engine of 
performing them, and reducing the size of intermediate results. Experimental results 
demonstrate that the number of selected sources remains low even when high number 
of endpoints and replicated fragments are part of the federations. Results rely on con¬ 
tainments induced by replicated fragment definitions. Next, selecting the same sources 
for Basic Graph Patterns triples is a strategy that allows to reduce significantly the 
number of intermediate results. Perspectives include dynamic discovery of endpoints 
providing replicated fragments. This allows federated query engines to expand at run¬ 
time declared federations with consumer endpoints of interests. Such mechanism can 
improve both data availability and performances of federated queries in Linked Data. 
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